To define temperature anomalies you need to have a reference, or base, period which NASA clearly states that it is the period between 1951-1980.
weather <-
read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv",
skip = 1,
na = "***")In the code below, we select the year and the twelve month variables from the weather dataset and convert the dataframe from wide to ‘long’ format.
tidyweather <- weather %>%
select(1:13) %>%
pivot_longer(cols = 2:13,
names_to = "Month",
values_to = "delta")Let us plot the data using a time-series scatter plot, and add a trendline.
tidyweather <- tidyweather %>%
mutate(date = ymd(paste(as.character(Year), Month, "1")),
month = month(date, label=TRUE),
year = year(date),
Month = factor(Month,
levels = c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"))
)
ggplot(tidyweather, aes(x=date, y = delta))+
geom_point()+
geom_smooth(color="red") +
theme_bw() +
labs (
title = "Weather Anomalies",
caption = "Figure 1.1"
)+
theme(plot.caption = element_text(hjust = 0))Is the effect of increasing temperature more pronounced in some months?
We now plot the temperature deviation separately for each month as shown below in Figure 1.2. In recent years, there seems to be greater trend towards a higher temperature deviation.
ggplot(tidyweather, aes(x=date, y = delta))+
geom_point()+
geom_smooth(color="red") +
facet_wrap('Month')+
theme_bw() +
labs (
title = "Weather Anomalies",
caption = "Figure 1.2"
)+
theme(plot.caption = element_text(hjust = 0))comparison <- tidyweather %>%
filter(Year>= 1881) %>% #remove years prior to 1881
#create new variable 'interval', and assign values based on criteria below:
mutate(interval = case_when(
Year %in% c(1881:1920) ~ "1881-1920",
Year %in% c(1921:1950) ~ "1921-1950",
Year %in% c(1951:1980) ~ "1951-1980",
Year %in% c(1981:2010) ~ "1981-2010",
TRUE ~ "2011-present"
))Inspect the comparison dataframe by clicking on it in the Environment pane.
delta), grouped by the different time periods.ggplot(comparison, aes(x=delta, fill=interval))+
geom_density(alpha=0.2) + #density plot with tranparency set to 20%
theme_bw() + #theme
labs (
title = "Density Plot for Monthly Temperature Anomalies",
y = "Density" ,
caption="Figure 1.3"
)+
theme(plot.caption = element_text(hjust = 0))#creating yearly averages
average_annual_anomaly <- tidyweather %>%
group_by(Year) %>%
summarise(annual_average_delta = mean(delta,na.rm=TRUE))
#plotting the data:
ggplot(average_annual_anomaly, aes(x=Year, y= annual_average_delta))+
geom_point()+
#Fit the best fit line, using LOESS method
geom_smooth() +
theme_bw() +
labs (
title = "Average Yearly Anomaly",
y = "Average Annual Delta",
caption = "Figure 1.4"
) +
theme(plot.caption = element_text(hjust = 0))deltaNASA points out on their website that
A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.
We construct a confidence interval for the average annual delta since 2011, both using a formula and using a bootstrap simulation.
formula_ci <- comparison %>%
filter(interval == "2011-present") %>%
summarise(
min_delta = min(delta,na.rm=TRUE),
max_delta = max(delta,na.rm=TRUE),
median_delta = median(delta,na.rm=TRUE),
mean_delta = mean(delta,na.rm=TRUE),
sd_delta = sd(delta,na.rm=TRUE),
count = n(),
se_delta = sd_delta/ sqrt(count),
t_critical = qt(0.975, count - 1 ),
lower = mean_delta - t_critical * se_delta,
upper = mean_delta + t_critical * se_delta
)
formula_ci %>% kbl(caption = "Table 1.1: Confidence Interval using Formula Method ") %>%
kable_classic(full_width = F, html_font = "Cambria")| min_delta | max_delta | median_delta | mean_delta | sd_delta | count | se_delta | t_critical | lower | upper |
|---|---|---|---|---|---|---|---|---|---|
| 0.46 | 1.94 | 1.04 | 1.06 | 0.276 | 132 | 0.024 | 1.98 | 1.01 | 1.11 |
library(tidyverse)
library(janitor)
library(infer)
library(here)
library(skimr)
set.seed(1234)
mean_delta <- 1.06
sd <- 0.276
n <- 127
se <- sd / sqrt(n)
lower95 <- mean_delta - qt(0.975,n-1) * se
upper95 <- mean_delta + qt(0.975,n-1) * se
bootstrap_comp <- comparison %>%
filter(interval == "2011-present",na.rm=TRUE) %>%
specify(response = delta) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
percentile_ci <-bootstrap_comp %>%
get_ci(level = 0.95, type = "percentile")
mean_deltq <- ggplot(bootstrap_comp, aes(x = stat)) +
geom_histogram() +
labs(title= "Bootstrap distribution of means",
x = "Mean Delta per bootstrap sample",
y = "Count") +
geom_vline(xintercept = percentile_ci$lower_ci, colour = 'orange', size = 2, linetype = 2) +
geom_vline(xintercept = percentile_ci$upper_ci, colour = 'orange', size = 2, linetype = 2)
visualize(bootstrap_comp) +
shade_ci(endpoints = percentile_ci,fill = "khaki")+
geom_vline(xintercept = lower95, colour = "red")+
geom_vline(xintercept = upper95, colour = "red")library(patchwork)
mean_deltqpercentile_ci## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1.01 1.11
What is the data showing us?
If we were to plot the confidence intervals as shown above using both the formula and infer method, this shows that they are both fairly close to each other and means that the bootstrap simulation gives a very good result of our estimated population delta.
As we saw in class, fivethirtyeight.com has detailed data on all polls that track the president’s approval
# Import approval polls data directly off fivethirtyeight website
approval_polllist <- read_csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv')
glimpse(approval_polllist)## Rows: 1,597
## Columns: 22
## $ president <chr> "Joseph R. Biden Jr.", "Joseph R. Biden Jr.", "Jos…
## $ subgroup <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate <chr> "9/10/2021", "9/10/2021", "9/10/2021", "9/10/2021"…
## $ startdate <chr> "1/24/2021", "1/24/2021", "1/25/2021", "1/25/2021"…
## $ enddate <chr> "1/26/2021", "1/27/2021", "1/27/2021", "1/26/2021"…
## $ pollster <chr> "Rasmussen Reports/Pulse Opinion Research", "Maris…
## $ grade <chr> "B", "A", "B", "B", "B", "B", "B", "B+", "B", "B",…
## $ samplesize <dbl> 1500, 1313, 1500, 2200, 15000, 9212, 1500, 906, 15…
## $ population <chr> "lv", "a", "lv", "a", "a", "a", "lv", "a", "a", "a…
## $ weight <dbl> 0.3225, 2.1893, 0.3025, 0.1205, 0.2739, 0.1520, 0.…
## $ influence <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve <dbl> 48.0, 49.0, 49.0, 58.0, 54.0, 55.0, 50.0, 57.0, 54…
## $ disapprove <dbl> 48.0, 35.0, 48.0, 32.0, 31.0, 31.5, 45.0, 37.0, 32…
## $ adjusted_approve <dbl> 50.4, 50.0, 51.4, 56.5, 52.5, 53.5, 52.4, 56.3, 52…
## $ adjusted_disapprove <dbl> 41.9, 34.6, 41.9, 35.4, 34.4, 34.9, 38.9, 36.4, 35…
## $ multiversions <chr> NA, NA, NA, NA, NA, "*", NA, NA, NA, NA, NA, NA, N…
## $ tracking <lgl> TRUE, NA, TRUE, NA, TRUE, TRUE, TRUE, NA, TRUE, NA…
## $ url <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id <dbl> 74261, 74320, 74268, 74346, 74277, 74292, 74290, 7…
## $ question_id <dbl> 139433, 139558, 139483, 139653, 139497, 139518, 13…
## $ createddate <chr> "1/27/2021", "2/1/2021", "1/28/2021", "2/5/2021", …
## $ timestamp <chr> "18:35:08 10 Sep 2021", "18:35:08 10 Sep 2021", "1…
# Use `lubridate` to fix dates, as they are given as charactersa
clean_approval_polllist <- approval_polllist %>% mutate(
startdate = mdy(approval_polllist$startdate),
enddate = mdy(approval_polllist$enddate)
)We calculate the average net approval rate (approve- disapprove) for each week since he got into office. We plot the net approval, along with its 95% confidence interval.
The plot should look like this:
`
The confidence interval for week 4 is [5.28-23.60] in comparison to the confidence interval for week 25 which is [9.79-13.47]. The widths of these two confidence intervals are different, 18.32 for week 4 and 3.68 for week 25. This is due to the difference in sample size, as there were 8 polls in week 4 but 28 polls in week 25.
library(gapminder)
skim(gapminder)| Name | gapminder |
| Number of rows | 1704 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| country | 0 | 1 | FALSE | 142 | Afg: 12, Alb: 12, Alg: 12, Ang: 12 |
| continent | 0 | 1 | FALSE | 5 | Afr: 624, Asi: 396, Eur: 360, Ame: 300 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 1.98e+03 | 1.73e+01 | 1952.0 | 1.97e+03 | 1.98e+03 | 1.99e+03 | 2.01e+03 | ▇▅▅▅▇ |
| lifeExp | 0 | 1 | 5.95e+01 | 1.29e+01 | 23.6 | 4.82e+01 | 6.07e+01 | 7.08e+01 | 8.26e+01 | ▁▆▇▇▇ |
| pop | 0 | 1 | 2.96e+07 | 1.06e+08 | 60011.0 | 2.79e+06 | 7.02e+06 | 1.96e+07 | 1.32e+09 | ▇▁▁▁▁ |
| gdpPercap | 0 | 1 | 7.22e+03 | 9.86e+03 | 241.2 | 1.20e+03 | 3.53e+03 | 9.33e+03 | 1.14e+05 | ▇▁▁▁▁ |
# load gapminder HIV data
hiv <- read_csv(here::here("data","adults_with_hiv_percent_age_15_49.csv"))
life_expectancy <- read_csv(here::here("data","life_expectancy_years.csv"))
# get World bank data using wbstats
indicators <- c("SP.DYN.TFRT.IN","SE.PRM.NENR", "SH.DYN.MORT", "NY.GDP.PCAP.KD")
library(wbstats)
worldbank_data <- wb_data(country="countries_only", #countries only- no aggregates like Latin America, Europe, etc.
indicator = indicators,
start_date = 1960,
end_date = 2016)
# get a dataframe of information regarding countries, indicators, sources, regions, indicator topics, lending types, income levels, from the World Bank API
countries <- wbstats::wb_cachelist$countriesYou have to join the 3 dataframes (life_expectancy, worldbank_data, and HIV) into one. You may need to tidy your data first and then perform join operations. Think about what type makes the most sense and explain why you chose it.
Firstly, we clean the 3 dataframes.
hiv_tidy <- hiv %>%
select(1:34) %>%
pivot_longer(cols=2:34,
names_to="year",
values_to = "hiv")%>%
drop_na()
hiv_tidy## # A tibble: 3,301 × 3
## country year hiv
## <chr> <chr> <dbl>
## 1 Afghanistan 2009 0.06
## 2 Afghanistan 2010 0.06
## 3 Afghanistan 2011 0.06
## 4 Algeria 1990 0.06
## 5 Algeria 1991 0.06
## 6 Algeria 1992 0.06
## 7 Algeria 1993 0.06
## 8 Algeria 1994 0.06
## 9 Algeria 1995 0.06
## 10 Algeria 1996 0.06
## # … with 3,291 more rows
skim(hiv_tidy)| Name | hiv_tidy |
| Number of rows | 3301 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| country | 0 | 1 | 3 | 24 | 0 | 154 | 0 |
| year | 0 | 1 | 4 | 4 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| hiv | 0 | 1 | 1.74 | 4.09 | 0.01 | 0.1 | 0.3 | 1.2 | 26.5 | ▇▁▁▁▁ |
life_expectancy_tidy <- life_expectancy %>%
select(1:302) %>%
pivot_longer(2:302,
names_to="year",
values_to="life_expectancy") %>%
drop_na()
life_expectancy_tidy## # A tibble: 55,528 × 3
## country year life_expectancy
## <chr> <chr> <dbl>
## 1 Afghanistan 1800 28.2
## 2 Afghanistan 1801 28.2
## 3 Afghanistan 1802 28.2
## 4 Afghanistan 1803 28.2
## 5 Afghanistan 1804 28.2
## 6 Afghanistan 1805 28.2
## 7 Afghanistan 1806 28.1
## 8 Afghanistan 1807 28.1
## 9 Afghanistan 1808 28.1
## 10 Afghanistan 1809 28.1
## # … with 55,518 more rows
worldbank_data_tidy <- worldbank_data %>%
mutate(year = as.character(date)) %>%
select(!date)
worldbank_data_tidy## # A tibble: 12,369 × 8
## iso2c iso3c country NY.GDP.PCAP.KD SE.PRM.NENR SH.DYN.MORT SP.DYN.TFRT.IN
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AW ABW Aruba NA NA NA 4.82
## 2 AW ABW Aruba NA NA NA 4.66
## 3 AW ABW Aruba NA NA NA 4.47
## 4 AW ABW Aruba NA NA NA 4.27
## 5 AW ABW Aruba NA NA NA 4.06
## 6 AW ABW Aruba NA NA NA 3.84
## 7 AW ABW Aruba NA NA NA 3.62
## 8 AW ABW Aruba NA NA NA 3.42
## 9 AW ABW Aruba NA NA NA 3.23
## 10 AW ABW Aruba NA NA NA 3.05
## # … with 12,359 more rows, and 1 more variable: year <chr>
The operation for joining the three dataframes should be left join, where the left dataframe is the shortest one. This allows us to make sure there are no missing values (NAs) in our new dataframe.
table_gapminder <- inner_join(left_join(left_join(hiv_tidy, worldbank_data_tidy, by = c("country" = "country", "year" = "year") ), life_expectancy_tidy, by = c("country" = "country","year" = "year")), countries, "country" = "country")
table_gapminder## # A tibble: 3,117 × 25
## country year hiv iso2c iso3c NY.GDP.PCAP.KD SE.PRM.NENR SH.DYN.MORT
## <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 2009 0.06 AF AFG 488. NA 91.4
## 2 Afghanistan 2010 0.06 AF AFG 543. NA 87.6
## 3 Afghanistan 2011 0.06 AF AFG 529. NA 83.9
## 4 Algeria 1990 0.06 DZ DZA 3572. 85.9 49.5
## 5 Algeria 1991 0.06 DZ DZA 3444. 87.6 48.1
## 6 Algeria 1992 0.06 DZ DZA 3424. 88.7 47
## 7 Algeria 1993 0.06 DZ DZA 3279. 88.2 45.9
## 8 Algeria 1994 0.06 DZ DZA 3183. 88.2 44.8
## 9 Algeria 1995 0.06 DZ DZA 3241. 88.4 43.6
## 10 Algeria 1996 0.06 DZ DZA 3315. 87.9 42.5
## # … with 3,107 more rows, and 17 more variables: SP.DYN.TFRT.IN <dbl>,
## # life_expectancy <dbl>, capital_city <chr>, longitude <dbl>, latitude <dbl>,
## # region_iso3c <chr>, region_iso2c <chr>, region <chr>,
## # admin_region_iso3c <chr>, admin_region_iso2c <chr>, admin_region <chr>,
## # income_level_iso3c <chr>, income_level_iso2c <chr>, income_level <chr>,
## # lending_type_iso3c <chr>, lending_type_iso2c <chr>, lending_type <chr>
table_gapminder %>%
ggplot(aes(x=hiv,y=life_expectancy))+
geom_point()+
geom_smooth(method="lm")+
facet_wrap(~region)+
labs(
title= "Relationship between HIV prevalence and life expectancy",
x= "HIV prevalence (% of population)",
y ="Life expectancy (years)"
) The graphs highlight a negative correlation between HIV prevalence and life expectancy in all regions beside Europe and Central Asia. This means that the higher the HIV prevalence, usually, the lower the life expectancy. It can be hypothesised that in Europe people are able to get treatment and medication for HIV, increasing their life expectancy.
table_gapminder %>%
ggplot(aes(x=NY.GDP.PCAP.KD,y=SP.DYN.TFRT.IN))+
geom_point()+
facet_wrap(~region)+
geom_smooth(method = "lm")+
labs(
title="Relationship between fertility rate and GDP per capita",
y="Fertility rate",
x= "GDP per capita"
) For all regions beside Europe & Central Asia and Middle East & North Africa we can highlight an inverse correlation, showcasing that the higher the GDP per capita, the lower the fertility rate. This demonstrates the connection between economic considerations and fertility choices, as poorer countries tend to have a higher fertility rate.
hiv_na <- hiv %>%
select(1:34) %>%
pivot_longer(cols=2:34,
names_to="year",
values_to = "hiv")
hiv_na_country <- left_join(hiv_na, countries, "country"= "country")
hiv_na_country %>%
filter(!is.na(region)) %>%
group_by(region) %>%
summarise(na=sum(is.na(hiv)))%>%
mutate(region=fct_reorder(region, na))%>%
ggplot(aes(x=na,y=region))+
geom_col()+
labs(
title="Number of missing HIV observations per region",
y="",
x= "Number of NA values"
) Sub-Saharan Africa and Europe & Central Asia have the highest number of NA values.
endpoints <- table_gapminder %>%
group_by(country)%>%
summarise(min_year=min(year), max_year=max(year))
endpoints## # A tibble: 143 × 3
## country min_year max_year
## <chr> <chr> <chr>
## 1 Afghanistan 2009 2011
## 2 Algeria 1990 2008
## 3 Angola 1979 2011
## 4 Argentina 1979 2011
## 5 Armenia 1990 2011
## 6 Australia 1983 2011
## 7 Austria 1981 2011
## 8 Azerbaijan 1990 2011
## 9 Bangladesh 1990 2011
## 10 Barbados 1979 2011
## # … with 133 more rows
join_min <- left_join(table_gapminder, endpoints, "country"="country") %>%
select(country, year, min_year, max_year, SH.DYN.MORT)%>%
mutate(mortality_origin = ifelse(year == min_year, SH.DYN.MORT, 0))%>%
select(!year)%>%
filter(!mortality_origin==0)%>%
select(!SH.DYN.MORT)
join_min## # A tibble: 139 × 4
## country min_year max_year mortality_origin
## <chr> <chr> <chr> <dbl>
## 1 Afghanistan 2009 2011 91.4
## 2 Algeria 1990 2008 49.5
## 3 Argentina 1979 2011 46.8
## 4 Armenia 1990 2011 48.9
## 5 Australia 1983 2011 11.9
## 6 Austria 1981 2011 15.5
## 7 Azerbaijan 1990 2011 95.2
## 8 Bangladesh 1990 2011 144.
## 9 Barbados 1979 2011 28.6
## 10 Belarus 1990 2011 15.2
## # … with 129 more rows
join_max <- left_join(table_gapminder, endpoints, "country"="country") %>%
select(country, year, min_year, max_year, SH.DYN.MORT)%>%
mutate(mortality_end = ifelse(year == max_year, SH.DYN.MORT, 0))%>%
select(!year)%>%
filter(!mortality_end==0)%>%
select(!SH.DYN.MORT)
join_max## # A tibble: 143 × 4
## country min_year max_year mortality_end
## <chr> <chr> <chr> <dbl>
## 1 Afghanistan 2009 2011 83.9
## 2 Algeria 1990 2008 29.5
## 3 Angola 1979 2011 112.
## 4 Argentina 1979 2011 13.9
## 5 Armenia 1990 2011 17.6
## 6 Australia 1983 2011 4.5
## 7 Austria 1981 2011 4.2
## 8 Azerbaijan 1990 2011 35
## 9 Bangladesh 1990 2011 46.1
## 10 Barbados 1979 2011 14.9
## # … with 133 more rows
join_minmax <- left_join(join_min, join_max, "country"="country")
join_minmax## # A tibble: 139 × 5
## country min_year max_year mortality_origin mortality_end
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Afghanistan 2009 2011 91.4 83.9
## 2 Algeria 1990 2008 49.5 29.5
## 3 Argentina 1979 2011 46.8 13.9
## 4 Armenia 1990 2011 48.9 17.6
## 5 Australia 1983 2011 11.9 4.5
## 6 Austria 1981 2011 15.5 4.2
## 7 Azerbaijan 1990 2011 95.2 35
## 8 Bangladesh 1990 2011 144. 46.1
## 9 Barbados 1979 2011 28.6 14.9
## 10 Belarus 1990 2011 15.2 5.1
## # … with 129 more rows
by_region <- left_join(join_minmax, countries, "country"="country") %>%
select(country,region,min_year,max_year,mortality_origin,mortality_end)
by_region## # A tibble: 139 × 6
## country region min_year max_year mortality_origin mortality_end
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Afghanistan South Asia 2009 2011 91.4 83.9
## 2 Algeria Middle East & N… 1990 2008 49.5 29.5
## 3 Argentina Latin America &… 1979 2011 46.8 13.9
## 4 Armenia Europe & Centra… 1990 2011 48.9 17.6
## 5 Australia East Asia & Pac… 1983 2011 11.9 4.5
## 6 Austria Europe & Centra… 1981 2011 15.5 4.2
## 7 Azerbaijan Europe & Centra… 1990 2011 95.2 35
## 8 Bangladesh South Asia 1990 2011 144. 46.1
## 9 Barbados Latin America &… 1979 2011 28.6 14.9
## 10 Belarus Europe & Centra… 1990 2011 15.2 5.1
## # … with 129 more rows
region_list = c("East Asia & Pacific" ,"Europe & Central Asia" ,"Latin America & Caribbean","Middle East & North Africa","North America","South Asia","Sub-Saharan Africa")
minmax_data <- left_join(join_minmax, countries, "country"="country") %>%
mutate(min_year = as.numeric(min_year),max_year = as.numeric(max_year),mortality_origin = as.numeric(mortality_origin),mortality_end = as.numeric(mortality_end),evolution_of_mortality = (mortality_end-mortality_origin)/mortality_origin)%>%
arrange(desc(evolution_of_mortality))%>%
select(country, region,min_year, max_year, mortality_origin, mortality_end, evolution_of_mortality)The following tables illustrate the top 5 and the lowest 5 countries according to improvement in mortality rates in each region.
East Asia & Pacific
minmax_data %>%
filter(region == region_list[1])%>%
slice_max(order_by = evolution_of_mortality,n = 5)## # A tibble: 5 × 7
## country region min_year max_year mortality_origin mortality_end
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 China East Asia & Pacific 2011 2011 14.6 14.6
## 2 Fiji East Asia & Pacific 1990 2011 28.9 23.6
## 3 Papua New Guinea East Asia & Pacific 1987 2011 89.9 56
## 4 Philippines East Asia & Pacific 1990 2011 56.6 31.3
## 5 Japan East Asia & Pacific 1990 2011 6.3 3.2
## # … with 1 more variable: evolution_of_mortality <dbl>
minmax_data %>%
filter(region == region_list[1])%>%
slice_max(order_by = -evolution_of_mortality,n = 5)## # A tibble: 5 × 7
## country region min_year max_year mortality_origin mortality_end
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Mongolia East Asia & Pacific 1990 2011 108. 27.9
## 2 Thailand East Asia & Pacific 1985 2011 47.8 13
## 3 Cambodia East Asia & Pacific 1985 2011 120. 40.6
## 4 Singapore East Asia & Pacific 1990 2011 7.7 2.8
## 5 Australia East Asia & Pacific 1983 2011 11.9 4.5
## # … with 1 more variable: evolution_of_mortality <dbl>
Europe & Central Asia
minmax_data %>%
filter(region == region_list[2])%>%
slice_max(order_by = evolution_of_mortality,n = 5)## # A tibble: 5 × 7
## country region min_year max_year mortality_origin mortality_end
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Ukraine Europe & Central Asia 1990 2011 19.4 11.2
## 2 Bulgaria Europe & Central Asia 1990 2011 18.4 10.4
## 3 Uzbekistan Europe & Central Asia 1990 2008 72.3 38.5
## 4 Moldova Europe & Central Asia 1990 2011 33.2 16.8
## 5 Switzerland Europe & Central Asia 1985 2011 9 4.5
## # … with 1 more variable: evolution_of_mortality <dbl>
minmax_data %>%
filter(region == region_list[2])%>%
slice_max(order_by = -evolution_of_mortality,n = 5)## # A tibble: 5 × 7
## country region min_year max_year mortality_origin mortality_end
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Portugal Europe & Central Asia 1979 2011 29.7 3.7
## 2 Greece Europe & Central Asia 1979 2011 21 3.9
## 3 Serbia Europe & Central Asia 1985 2011 39.1 7.4
## 4 Luxembourg Europe & Central Asia 1979 2011 13.9 3
## 5 Italy Europe & Central Asia 1979 2011 17 3.9
## # … with 1 more variable: evolution_of_mortality <dbl>
Latin America & Caribbean
minmax_data %>%
filter(region == region_list[3])%>%
slice_max(order_by = evolution_of_mortality,n = 5)## # A tibble: 5 × 7
## country region min_year max_year mortality_origin mortality_end
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Costa Rica Latin Am… 1990 2011 16.9 10.2
## 2 Trinidad and Tobago Latin Am… 1982 2011 36.9 22.2
## 3 Barbados Latin Am… 1979 2011 28.6 14.9
## 4 Guyana Latin Am… 1979 2011 70.5 36.5
## 5 Belize Latin Am… 1990 2011 38.3 18.3
## # … with 1 more variable: evolution_of_mortality <dbl>
minmax_data %>%
filter(region == region_list[3])%>%
slice_max(order_by = -evolution_of_mortality,n = 5)## # A tibble: 5 × 7
## country region min_year max_year mortality_origin mortality_end
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Peru Latin America & … 1979 2011 129 18.9
## 2 Brazil Latin America & … 1979 2011 100. 17.9
## 3 Ecuador Latin America & … 1979 2011 96.1 17.5
## 4 El Salvador Latin America & … 1985 2011 79.8 18.3
## 5 Uruguay Latin America & … 1979 2011 44 10.3
## # … with 1 more variable: evolution_of_mortality <dbl>
Middle East & North Africa
minmax_data %>%
filter(region == region_list[4])%>%
slice_max(order_by = evolution_of_mortality,n = 5)## # A tibble: 5 × 7
## country region min_year max_year mortality_origin mortality_end
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Malta Middle East & North Africa 1990 2011 11.4 6.8
## 2 Algeria Middle East & North Africa 1990 2008 49.5 29.5
## 3 Djibouti Middle East & North Africa 1983 2011 142. 74.1
## 4 Qatar Middle East & North Africa 1990 2008 20.8 9.6
## 5 Morocco Middle East & North Africa 1990 2011 79.1 30.4
## # … with 1 more variable: evolution_of_mortality <dbl>
minmax_data %>%
filter(region == region_list[4])%>%
slice_max(order_by = -evolution_of_mortality,n = 5)## # A tibble: 5 × 7
## country region min_year max_year mortality_origin mortality_end
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Lebanon Middle East & North Africa 1979 2011 51.5 9.8
## 2 Israel Middle East & North Africa 1979 2011 19.1 4.4
## 3 Oman Middle East & North Africa 1990 2008 38.7 12
## 4 Tunisia Middle East & North Africa 1990 2011 55.3 18
## 5 Morocco Middle East & North Africa 1990 2011 79.1 30.4
## # … with 1 more variable: evolution_of_mortality <dbl>
North America
minmax_data %>%
filter(region == region_list[5])%>%
slice_max(order_by = evolution_of_mortality,n = 5)## # A tibble: 2 × 7
## country region min_year max_year mortality_origin mortality_end
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 United States North America 1979 2011 15.6 7.2
## 2 Canada North America 1979 2011 13.3 5.6
## # … with 1 more variable: evolution_of_mortality <dbl>
minmax_data %>%
filter(region == region_list[5])%>%
slice_max(order_by = -evolution_of_mortality,n = 5)## # A tibble: 2 × 7
## country region min_year max_year mortality_origin mortality_end
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Canada North America 1979 2011 13.3 5.6
## 2 United States North America 1979 2011 15.6 7.2
## # … with 1 more variable: evolution_of_mortality <dbl>
South Asia
minmax_data %>%
filter(region == region_list[6])%>%
slice_max(order_by = evolution_of_mortality,n = 5)## # A tibble: 5 × 7
## country region min_year max_year mortality_origin mortality_end
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan South Asia 2009 2011 91.4 83.9
## 2 Pakistan South Asia 1990 2011 140. 85
## 3 Sri Lanka South Asia 1990 2011 21.9 11.2
## 4 India South Asia 1990 2009 126. 61.4
## 5 Bangladesh South Asia 1990 2011 144. 46.1
## # … with 1 more variable: evolution_of_mortality <dbl>
minmax_data %>%
filter(region == region_list[6])%>%
slice_max(order_by = -evolution_of_mortality,n = 5)## # A tibble: 5 × 7
## country region min_year max_year mortality_origin mortality_end
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Maldives South Asia 1990 2011 85.8 13
## 2 Nepal South Asia 1985 2011 175. 44.2
## 3 Bhutan South Asia 1990 2011 127 39.9
## 4 Bangladesh South Asia 1990 2011 144. 46.1
## 5 India South Asia 1990 2009 126. 61.4
## # … with 1 more variable: evolution_of_mortality <dbl>
Sub-Saharan Africa
minmax_data %>%
filter(region == region_list[7])%>%
slice_max(order_by = evolution_of_mortality,n = 5)## # A tibble: 5 × 7
## country region min_year max_year mortality_origin mortality_end
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 South Sudan Sub-Sa… 2011 2011 102. 102.
## 2 Lesotho Sub-Sa… 1984 2011 101. 96.1
## 3 Ethiopia Sub-Sa… 2009 2011 86.7 77.6
## 4 Sao Tome and Principe Sub-Sa… 2009 2011 47.9 42.4
## 5 Somalia Sub-Sa… 1987 2011 186 151.
## # … with 1 more variable: evolution_of_mortality <dbl>
minmax_data %>%
filter(region == region_list[7])%>%
slice_max(order_by = -evolution_of_mortality,n = 5)## # A tibble: 5 × 7
## country region min_year max_year mortality_origin mortality_end
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Rwanda Sub-Saharan Africa 1979 2011 233. 57.1
## 2 Eritrea Sub-Saharan Africa 1985 2011 178. 53.2
## 3 Malawi Sub-Saharan Africa 1980 2011 256. 78.6
## 4 Senegal Sub-Saharan Africa 1981 2011 198. 63
## 5 Uganda Sub-Saharan Africa 1979 2011 216. 72.5
## # … with 1 more variable: evolution_of_mortality <dbl>
How has mortality rate for under 5 changed by region?
evolution_change <- by_region %>%
mutate(min_year = as.numeric(min_year),max_year = as.numeric(max_year),mortality_origin = as.numeric(mortality_origin),mortality_end = as.numeric(mortality_end),evolution_of_mortality = (mortality_end-mortality_origin)/mortality_origin)%>%
arrange(desc(evolution_of_mortality))
evolution_change## # A tibble: 139 × 7
## country region min_year max_year mortality_origin mortality_end
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 China East … 2011 2011 14.6 14.6
## 2 South Sudan Sub-S… 2011 2011 102. 102.
## 3 Lesotho Sub-S… 1984 2011 101. 96.1
## 4 Afghanistan South… 2009 2011 91.4 83.9
## 5 Ethiopia Sub-S… 2009 2011 86.7 77.6
## 6 Sao Tome and Principe Sub-S… 2009 2011 47.9 42.4
## 7 Fiji East … 1990 2011 28.9 23.6
## 8 Somalia Sub-S… 1987 2011 186 151.
## 9 Mauritania Sub-S… 1990 2011 118. 93.7
## 10 Central African Republic Sub-S… 1979 2011 188. 143.
## # … with 129 more rows, and 1 more variable: evolution_of_mortality <dbl>
evolution_by_region <- evolution_change %>%
group_by(region) %>%
summarise(mean_mortality_evol = mean(evolution_of_mortality))
evolution_by_region## # A tibble: 7 × 2
## region mean_mortality_evol
## <chr> <dbl>
## 1 East Asia & Pacific -0.514
## 2 Europe & Central Asia -0.651
## 3 Latin America & Caribbean -0.633
## 4 Middle East & North Africa -0.598
## 5 North America -0.559
## 6 South Asia -0.555
## 7 Sub-Saharan Africa -0.455
evolution_by_region %>%
ggplot(aes(y = reorder(region, --mean_mortality_evol), x=100*mean_mortality_evol))+
geom_col()+
labs(
title= "Change in mortality rate by region",
x= "Change in mortality rate (%)",
y=""
) The biggest change in mortality rate can be noted in Europe and Central Asia, over -60%, meaning that the rate has gone down. The least difference can be noticed in Sub-Saharan Africa.
table_gapminder %>%
group_by(country) %>%
ggplot(aes(x=SE.PRM.NENR,y=SP.DYN.TFRT.IN))+
geom_point()+
facet_wrap(~region)+
geom_smooth(method = "lm")+
labs(
title="Relationship between fertility rate and primary school enrollment",
y="Fertility rate",
x= "Primary school enrollment, (%) population"
) The higher the primary school enrollment in all regions but Europe and Central Asia, the lower the fertility rate.
url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"
# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2021-08-23T14%3A32%3A29/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20210911%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20210911T133924Z&X-Amz-Expires=300&X-Amz-Signature=e887fc183fd2074150a05ea30ae23e430d8c0026591c85b86213f78ef9d4f1e1&X-Amz-SignedHeaders=host]
## Date: 2021-09-11 13:39
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 173 kB
## <ON DISK> /var/folders/91/dm_fdgkx0l56y1cm7zy73wj40000gn/T//RtmpokIBPw/file86f70e20191.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
sheet = "Data",
range = cell_cols("A:B"))
# change dates to get year, month, and week
bike <- bike0 %>%
clean_names() %>%
rename (bikes_hired = number_of_bicycle_hires) %>%
mutate (year = year(day),
month = lubridate::month(day, label = TRUE),
week = isoweek(day))It’s interesting to note that during the 2020 lockdown, rentals decreased by half in comparison to other years.
The second one looks at percentage changes from the expected level of weekly rentals. The two grey shaded rectangles correspond to Q2 (weeks 14-26) and Q4 (weeks 40-52).
We used the mean to calculate our expected rentals as when we tested using the median it was quite different from the actual numbers.
Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.
Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).
Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.